【第6天】資料前處理-資料擴增

2021 iThome 鐵人賽

DAY 6

AI & Data

手寫中文字之影像辨識系列第 6 篇

13th鐵人賽

midnightla

2021-09-21 23:10:15

4794 瀏覽

分享至

現況

辨識手寫中文字時，若圖檔內中文字跡有部分缺失，或是油墨洩漏造成字跡髒汙，可能導致模型辨識錯誤，如下圖。
訓練影像辨識模型時，總不可避免地面臨過擬合的情形，影響辨識效果。通常第一個想到的方法就是增加訓練樣本，以影像處理進行Data Augmentation，亦屬於此列。
我們希望在資料前處理時，針對train資料集的圖檔加入椒鹽雜訊，同時進行Data Augmentation，1張圖檔轉變成4張，降低上述兩點發生的可能性。

工具/套件

opencv
numpy
random：將圖檔分割成25個區域後，使用rrandom.sample()隨機抽取3個區域，用以加入椒鹽雜訊。
itertools：用以處理dict、list、tuple、str等資料類型，此處以product計算乘積。

內容

椒鹽雜訊(Salt&Pepper Noise)：分成鹽噪音與胡椒噪音，鹽噪音代表白色噪塊(0)、胡椒噪音代表黑色躁塊(255)，當兩者同時出現在圖像上，會呈現黑白雜點，如下圖。

圖片來自於：https://blog.csdn.net/u011995719/article/details/83375196

Data Augmentation
2.1 讀取中文路徑圖檔

import os
import cv2
import numpy as np
import random
from itertools import product

# 讀取圖檔
def cv_imread(filePath):
    cv_img = cv2.imdecode(np.fromfile(filePath,dtype=np.uint8),-1)
    return cv_img

2.2 生成椒鹽雜訊

# 隨機生成椒鹽訊號
def random_mask(filename):
    # 讀取圖檔
    img = cv_imread(filename)
    img_high = img.shape[0]
    img_width = img.shape[1]

    # 將圖檔高度分成5等分
    img_high1, img_high2, img_high3, img_high4 = round(img_high*0.2), round(img_high*0.4), round(img_high*0.6), round(img_high*0.8)
    img_high_list = [0, img_high1, img_high2, img_high3, img_high4]

    # 將圖檔長度分成5等分
    img_width1, img_width2, img_width3, img_width4 = round(img_width*0.2), round(img_width*0.4), round(img_width*0.6), round(img_width*0.8)
    img_width_list = [0, img_width1, img_width2, img_width3, img_width4]

    # 切割成5*5個區塊
    combine = list(product(img_high_list, img_width_list))

    # 隨機抽取3個(取後不放回)
    masks = random.sample(combine, 3)
    print('masks:', masks)
    print('shape', img.shape)

    # mask
    mask_high = round(img_high/5)
    mask_widght = round(img_width/5)
    for i in masks:
        # y是寬，x是高
        (x, y) = i
        top_left = (y, x)
        bottom_right_widght = y + mask_widght
        bottom_right_high   = x + mask_high
        if y + img_width * 0.2 > img_width :
            bottom_right_widght = img_width
        if x + img_high * 0.2 > img_high:
            bottom_right_high = img_high
        bottom_right = (bottom_right_widght, bottom_right_high)
        print('top_left', top_left)
        print('bottom_right', bottom_right)
        # 產生黑、白躁點(黑色躁點為0、白色躁點為255)
        # cv2.rectangle(img, top_left, bottom_right, 255, -1)
        cv2.rectangle(img, top_left, bottom_right, 0, -1)
    print('-' * 20)
    return img

2.3 顯示mask成果

# 顯示圖檔
def show_img(name, img):
    cv2.namedWindow(name, cv2.WINDOW_NORMAL)
    cv2.resizeWindow(name, 160, 120)
    cv2.imshow(name, img)
    cv2.waitKey()

2.4 將1張圖檔資料擴增成4張，並另存新檔

if __name__ == '__main__':
    # 圖檔路徑
    src_dir_name = './train/'
    target_dir_name = './test/'

    # 隨機生成椒鹽訊號，並另存新檔
    for i in os.listdir(src_dir_name):
        sub_folder = src_dir_name + i + '/'
        for i in os.listdir(sub_folder):
            print(sub_folder+i)
            # 生成3張有椒鹽訊號的新圖檔
            img1 = random_mask(sub_folder+i)
            img1 = cv2.resize(img1, (160, 120), interpolation=cv2.INTER_CUBIC)
            img2 = random_mask(sub_folder+i)
            img2 = cv2.resize(img1, (160, 120), interpolation=cv2.INTER_CUBIC)
            img3 = random_mask(sub_folder+i)
            img3 = cv2.resize(img1, (160, 120), interpolation=cv2.INTER_CUBIC)
            # 將新增隨機椒鹽雜訊後的圖檔另存新檔
            cv2.imencode('.jpg', img1)[1].tofile(sub_folder+'1_'+i)
            cv2.imencode('.jpg', img2)[1].tofile(sub_folder+'2_'+i)
            cv2.imencode('.jpg', img3)[1].tofile(sub_folder+'3_'+i)
            print('=' * 50)
            show_img('name', img1)

小結

要訓練出一個好的影像辨識模型，高品質的資料集是不可或缺的，因此，我們耗費了大量時間在資料前處理。歷經4天的資料前處理，最終得到新資料集，數量如下。
1.1 train資料集：174,808張
1.2 test資料集：18,717張
工欲善其事，必先利其器！下一站，我們前往訓練模型的「前置作業」。分享如何在本地端安裝tensorflow架構與使用GPU訓練模型，提高訓練模型效率。

讓我們繼續看下去...